-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[24.0] Script for deleting userless histories from database #18058
Conversation
0d68929
to
1555014
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but maybe move this to lib/galaxy/model/ and create a normal entrypoint so the script becomes installable ? Then you can drop the shell script and the manual entrypoint, and it's one step less to create a celery task if we choose to do that.
You mean like the scripts in the data package (e.g. What's the criteria we use to decide when a script goes into the |
I don't think we have one. If I had to create one now I would say things that are using lots of galaxy internals should probably go into the package they primarily deal with ? Further consideration would be if this might be used in an installed galaxy context, where the scripts aren't usually available. |
Thanks! This makes sense. |
Superseded by #18079 (that one targets the dev branch) |
Ref #17725
This is a draft. The script needs to get
db_url
from config;batch_size
andmax_create_time
to be passed as arguments.Included: script and test (with infrastructure, partially borrowed from not-yet-merged #17662)
I've tested this manually + using the included test, on a postgresql database. I've also (partially) verified that the algorithm should be able to handle the data volume in main's database (including the
history_dataset_association
table, which was the main offender).Here's what the test does:
I've created a new directory under
scripts
for this and similar scripts.What this script does and why
My first attempts, as discussed in #17725 were SQL-only. However, based on my testing, I think the size of the tables make that approach infeasible. The main issue is selecting the history ids to delete excluding the ones that are referred to from tables whose records we shouldn't be deleting (such as job, hda, etc.): that can be done via multiple clauses like
where id not in (select history_id from another-table)
. Given the table size, this won't work.But the operations are trivial - there's no need for any joins. Also, since this deals with records that are not used, we don't need locking (i.e., old histories only + we mark them as deleted and purged BEFORE selecting the set to delete). Therefore, we don't need the database for anything other than providing us with several (very large) lists of integers that we can then manipulate in memory as sets, which makes it fast enough to be doable.
The algorithm:
copy
instead of batch insert would be faster, but that requires elevated permissions, so I skipped it; insert should be sufficient)We do the above in batches with the size of a batch configurable. The reason for this is the hda table: a simple
select history_id from history_dataset_association
appears to be problematic (my process was repeatedly killed after an hour). If there is some postgresql or unix magic that can help with this, we don't need batches; but batches are simple and work fine.Comments and harsh critique certainly most welcome!
How to test the changes?
(Select all options that apply)
License